-
Notifications
You must be signed in to change notification settings - Fork 118
Documentation for WGR #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add Leland's demo notebook
…or WGR (#2) * blocks Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * test vcf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * transformer Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * remove extra Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * refactor and conform with ridge namings Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * test Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * test files Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * remove extra file Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * sort_key Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* feat: ridge models for wgr added Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Doc strings added for levels/functions.py Some typos fixed in ridge_model.py Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * ridge_model and RidgeReducer unit tests added Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * RidgeRegression unit tests added test data README added ridge_udfs.py docstrings added Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Changes made to accessing the sample ID map and more docstrings The map_normal_eqn and score_models functions previously expected the sample IDs for a given sample block to be found in the Pandas DataFrame, which mean we had to join them on before the .groupBy().apply(). These functions now expect the sample block to sample IDs mapping to be provided separately as a dict, so that the join is no longer required. RidgeReducer and RidgeRegression APIs remain unchanged. docstrings have been added for RidgeReducer and RidgeRegression classes. Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Refactored object names and comments to reflect new terminology Where 'block' was previously used to refer to the set of columns in a block, we now use 'header_block' Where 'group' was previously used to refer to the set of samples in a block, we now use 'sample_block' Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>
* WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * existing tests pass Signed-off-by: Karen Feng <karen.feng@databricks.com> * rename file Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add compat test Signed-off-by: Karen Feng <karen.feng@databricks.com> * scalafmt Signed-off-by: Karen Feng <karen.feng@databricks.com> * collect minimal columns Signed-off-by: Karen Feng <karen.feng@databricks.com> * address comments Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching Signed-off-by: Karen Feng <karen.feng@databricks.com> * PyArrow 0.15.1 only with PySpark 3 Signed-off-by: Karen Feng <karen.feng@databricks.com> * Don't use toPandas() Signed-off-by: Karen Feng <karen.feng@databricks.com> * Upgrade pyarrow Signed-off-by: Karen Feng <karen.feng@databricks.com> * Only register once Signed-off-by: Karen Feng <karen.feng@databricks.com> * Minimize memory usage Signed-off-by: Karen Feng <karen.feng@databricks.com> * Select before head Signed-off-by: Karen Feng <karen.feng@databricks.com> * set up/tear down Signed-off-by: Karen Feng <karen.feng@databricks.com> * Try limiting pyspark memory Signed-off-by: Karen Feng <karen.feng@databricks.com> * No teardown Signed-off-by: Karen Feng <karen.feng@databricks.com> * Extend timeout Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * existing tests pass Signed-off-by: Karen Feng <karen.feng@databricks.com> * rename file Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add compat test Signed-off-by: Karen Feng <karen.feng@databricks.com> * scalafmt Signed-off-by: Karen Feng <karen.feng@databricks.com> * collect minimal columns Signed-off-by: Karen Feng <karen.feng@databricks.com> * start changing for readability * use input label ordering * rename create_row_indexer * undo column sort * change reduce Signed-off-by: Henry D <henrydavidge@gmail.com> * further simplify reduce * sorted alpha names * remove ordering * comments Signed-off-by: Henry D <henrydavidge@gmail.com> * Set arrow env var in build Signed-off-by: Henry D <henrydavidge@gmail.com> * faster sort * add test file * undo test data change * >= * formatting * empty Co-authored-by: Karen Feng <karen.feng@databricks.com>
* yapf Signed-off-by: Karen Feng <karen.feng@databricks.com> * yapf transform Signed-off-by: Karen Feng <karen.feng@databricks.com> * Set driver memory Signed-off-by: Karen Feng <karen.feng@databricks.com> * Try changing spark mem Signed-off-by: Karen Feng <karen.feng@databricks.com> * match java tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * whoops Signed-off-by: Karen Feng <karen.feng@databricks.com> * remove driver memory flag Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com> * whoops Signed-off-by: Karen Feng <karen.feng@databricks.com> * cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * whoops Signed-off-by: Karen Feng <karen.feng@databricks.com> * tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * simplify tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * yapf Signed-off-by: Karen Feng <karen.feng@databricks.com> * index map compat Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add docs Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add more tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * pass args as ints Signed-off-by: Karen Feng <karen.feng@databricks.com> * Don't roll our own splitter Signed-off-by: Karen Feng <karen.feng@databricks.com> * rename sample_index to sample_blocks Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Add type-checking to APIs Signed-off-by: Karen Feng <karen.feng@databricks.com> * Check valid alphas Signed-off-by: Karen Feng <karen.feng@databricks.com> * check 0 sig Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add to install_requires list Signed-off-by: Karen Feng <karen.feng@databricks.com> * cleanup comments Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Added necessary modifications to accomodate covariates in model fitting. The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X. This PR makes numerous changes to accomodate covariate matrix C. Adding covariates required the following breaking changes to the APIs: * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform(): * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf) * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf) Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument. Two new tests have been added to test_ridge_regression.py to test run modes with covariates: * test_ridge_reducer_transform_with_cov * test_two_level_regression_with_cov Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Cleaned up one unnecessary Pandas import Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Small changes for clarity and consistence with the rest of the code. Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Forgot one usage of coalesce Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Added a couple of comments to explain logic and replaced usages of .values with .array Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Fixed one instance of the change .values -> .array where it was made in error. Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Typo in test_ridge_regression.py. Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> * Style auto-updates with yapfAll Signed-off-by: Leland Barnard (leland.barnard@gmail.com) Signed-off-by: Leland Barnard <leland.barnard@regeneron.com> Co-authored-by: Leland Barnard <leland.barnard@regeneron.com> Co-authored-by: Karen Feng <karen.feng@databricks.com>
* WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Clean up tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Order to match labeldf Signed-off-by: Karen Feng <karen.feng@databricks.com> * Check we tie-break Signed-off-by: Karen Feng <karen.feng@databricks.com> * cleanup Signed-off-by: Karen Feng <karen.feng@databricks.com> * tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * test var name Signed-off-by: Karen Feng <karen.feng@databricks.com> * clean up tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Clean up docs Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-docs Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Rename levels to wgr Signed-off-by: Karen Feng <karen.feng@databricks.com> * rename test files Signed-off-by: Karen Feng <karen.feng@databricks.com>
* headers * executable * fix template rendering * yapf
…wgr-docs Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…-docs Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Codecov Report
@@ Coverage Diff @@
## master #235 +/- ##
=======================================
Coverage 93.75% 93.75%
=======================================
Files 90 90
Lines 4339 4339
Branches 406 406
=======================================
Hits 4068 4068
Misses 271 271 Continue to review full report at Codecov.
|
…-docs Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
williambrandler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments and clarifications!
|
|
||
| The genotype data may be read from any variant datasource supported by Glow, such as VCF, BGEN or PLINK. The DataFrame | ||
| must also include a column ``values`` containing a numeric representation of each genotype. The genotypic values may | ||
| not be missing, or equal for every sample in a variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does equal mean here? All homozygous reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mathematically, we're trying to filter out variants for which all samples have the same calls and therefore values has a variance/stddev of 0 (eg.
all hom ref, all hom-alt, or even all het). I'm not sure what the best way to phrase this is.
| - Split multiallelic variants with the ``split_multiallelics`` transformer. | ||
| - Calculate the number of alternate alleles for biallelic variants with ``glow.genotype_states``. | ||
| - Replace any missing values with the mean of the non-missing values using ``glow.mean_substitute``. | ||
| - Filter out all homozygous SNPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter out all SNPs that contain zero non-reference alleles
| The fields in the model DataFrame are: | ||
|
|
||
| - ``header_block``: An ID assigned to the block x0 corresponding to the coefficients in this row. | ||
| - ``sample_block``: An ID assigned to the block x0 corresponding to the coefficients in this row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
header_block and sample_block have the same description?
…-docs Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
williambrandler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having a comment up front that GlowGR only supports quantitative phenotypes for now, and we plan to implement binary traits in the near future?
Otherwise LGTM
I added a note that this only supports quantitative phenotypes. I'm going to avoid making promises in our docs. |
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
henrydavidge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! Thanks @karenfeng !
What changes are proposed in this pull request?
Creates documentation for WGR.
How is this patch tested?